Introduction

In 2020, the entire world was shaken by the global SARS-CoV-2 pandemic. The first case was reported in China in mid-December 2019. The animal and seafood market in Wuhan, China, is indicated as source of the disease. A lot of countries introduced a state of emergency to limit the spread of the virus. Unfortunately, due to many interstate connections, the connections, the virus spread very quickly to many continents.

The SARS-CoV-2 virus causes the COVID-19 disease, which the symptomps are very similar to those of the seasonal flu. The virus affects the respiratory organs, mainly the lungs.The disease is most dangerous for the elderly and people with so-called concomitant diseases (e.g. diabetes, lung diseases, cardiovascular diseases). Common symptoms of coronavirus infection are:
  • High fever
  • Cough
  • Dyspnea

The disease may lead to complications, e.g. pneumonia, acute respiratory distress syndrome.

To date (November 2020) there is no vaccine or effective antiviral drugs. Treatment is based on symptomatic treatment and supportive therapy (in the case of respiratory disorder). In order to counteract the spread of disease, frequent hand washing and surface disinfection are recommended.

The economy was significantly affected by the pandemic. Many countries have decided on implementing the so-called Lockdown - a temporary shutdown of the economy in order to avoid rallying people and infecting others. The education system also suffered - online learning was introduced in many schools and universities.

About the document

The following document is a coronavirus cases study based on this article published on May 14, 2020. The data concerns 375 patients in China (Tongji Hospital). In the conducted analysis, the most discriminating biomarkers of patient mortality were identified using machine learning tools. The problem was defined as a classification task, where the input data included blood samples and laboratory test results.

Used libraries

This report is using following R libraries:
  • dplyr
  • ggplot2
  • readxl
  • knitr
  • tidyr
  • lubridate
  • plotly
  • gtsummary
  • gganimate
  • caret
  • pROC
  • ModelMetrics
  • ggraph
  • igraph
  • bibtex

Dataset - description

The data set is import from flat file.

cov_cs_df <- read_excel("data\\wuhan_blood_sample_data_Jan_Feb_2020.xlsx")

Dataset has 6120 rows and 81 columns. It is too much to show them all. Below there is only few of columns:

kable(head(cov_cs_df[1:10], 5))
PATIENT_ID RE_DATE age gender Admission time Discharge time outcome Hypersensitive cardiac troponinI hemoglobin Serum chloride
1 2020-01-31 01:09:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0 NA NA NA
NA 2020-01-31 01:25:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0 NA 136 NA
NA 2020-01-31 01:44:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0 NA NA 103.1
NA 2020-01-31 01:45:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0 NA NA NA
NA 2020-01-31 01:56:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0 19.9 NA NA
As you can see, first 7 columns present the information on patients, while the others describe their blood tests’ ressults. Below, you will find brief explanation for each column.
  1. PATIENT_ID - identifier of patient in dataset; in dataset are 375 unique patients
  2. RE_DATE - date of research; first research was made on the 2020-01-10 19:45:00 and the last was made 2020-02-18 17:49:00
  3. age - age of patient; the youngest patient was 18 and the oldest one was 95
  4. gender - sex of the patients
  5. Admission time - date of the patient admission; the first patient was admitted on 2020-01-10 15:52:20 and the last was 2020-02-17 21:30:07
  6. Discharge time - date of the dischargepatient; the first patient was discharged on the 2020-01-23 09:09:23 and the last was 2020-01-23 09:09:23
  7. outcome - whether the patient survived or died
  8. Hypersensitive cardiac troponin I - cardiac troponins are the proteins that are part of the heart muscle cells; in normal condition their concentration is close to zero but it rises after a hearth attack
  9. hemoglobin - it is responsible for the red color of blood and its primary fucntion is to transport the oxygen; Hemoglobin for adults are:
    • women - 12,0 - 16,0 g/dl (7,2 - 10,0 mmol/l)
    • men - 14,0- 18,0 g/dl (7,8 - 11,3 mmol/l)
    • pregnant women - 11 – 14 g/dl (6,9–8,8 mmol/l)
  10. Serum chloride - Chlorine is the major anion of the extracellular factor, including blood plasma; The proper concentration of chlorides in blood ensures the proper functioning of the neuromuscular and digestive systems; normal concentraion of chloride in blood: 95–105 mmol/l. For women, the average concentration is slighty higher (around 2–2.5 mmol/l) forwomen than for men
  11. Prothrombin time - it is a parameter describing efficiency of the extrinsic coagulation system; normal result is 12 - 16 seconds
  12. procalcitonin - PCT is a substance whose presence in the blood indicates a bacterial infection. Testing procalcitonin levels allows early diagnosis of an infection, even when there aren’t any symptomps yet; for a healthy individual the condensation is low (<0.1 ng/mL) but when it is > 0.5 ng/mL it indicates bacterial infection
  13. eosinophils(%) - it is a type of white blood cell, also known as leukocytes; their main task is to participate in the immune response of our body; the appropriate percentage of eosinophils as a component of leukocytes, depending on the adopted standards, ranges from 1-5%
  14. Interleukin 2 receptor - it is cytokine (protein) that incluences, among other things, the activity of lymphocytes.
  15. Alkaline phosphatase - known as ALP, is an enzyme that occurs in many cells of the human body; depending on the place of occurrence reaches different concentrations; the norms are established based on the age of patient
  16. albumin - is the main serum protein procued in the liver; their main role in a human body is to transport hormones(e.g. cortisol) drugs (e.g. antibiotics) as well as vitamins, fatty acids and lipids; the norms depend on e.g. the age of patient, gender, determination method; approximate norms for particular periods of life:
    • children (not preterm): 4.6-7.4 g/dl
    • 7-19 years: 3.7-5.6 g/dl
    • adults: 3.5-5.5 g/dl
  17. basophil(%) - they are one of the morphotic components of blood; make up only about 1% of leukocytes, i.e. white blood cells; they are involved in boyd’s defense against penetrating microorganisms; it is assumed that basophils should be up to 1% of leukocytes, that is, all white blood cells
  18. Interleukin 10 - it is a cytokine (protein); their main function is to block inflammatory process
  19. Total bilirubin - direct and indirect bilirubin form total bilirubin - a yellow pigment, a product of breakdown of red blood cells; the norm for total bilirubin is 0,2-1,1 mg% (3,42-20,6 µmol/l)
  20. Platelet count - thrombocytes, or platelets, are, next to white and red blood cells, blood cells; platelets play a key role in the clotting process; the normal count of platelets is 150-400 thousand on µl
  21. monocytes(%) - are white blood cells and are the largest blood cells in our bloodstream; have, among others the ability to phagocytose bacteria and to produce various mediators of the immune response, such as interferon; the norm for the adults is 4-8% of total amount among all leukocytes
  22. antithrombin - it is an antigen synthesized mainly in the liver and endothelium of blood vessels. megakaryocytes and platelets; in a healthy human, plasma is from 20 to 29 IU/ml with an activity of 75-150%; antithrombin is the main inhibitor of plasma thrombin and is therefore used to assesss the state of coagulation system
  23. Interleukin 8 - it is a cytokine that stimulates the migration of immune cells in the body; this means that it stimulates the movement and spread of T lymphocytes, neutrophils and monocytes; this action is is defensive in nature
  24. indirect bilirubin - indirect bilirubin is a part of total bilirubin; the norms for indirect bilirubin is 0.2 – 1.0 mg/dl
  25. Red blood cell distribution width - this indicator in blood counts tells what the volume differences are between individual red blood cells in a patient; the values are given in femtoliters (fl); generally, 36-47 fl is considered the standard
  26. neutrophils(%) - this is the most numerous group of white blood cells of the immune system; the task of neutrophils is to protect the body against infections and diseases (they provide so-called cellular immunity); both their low blod level and excess can indicate many serious diseases; the norm for neutrophils is 60-70% of all white blood cells
  27. total protein - it is a laboratory test that measures the concentration of all proteins in the blood; protein is an important componen of plasma. It maintains adequate pressure inside blood vessels, transport nutrients, is involved in the coagulation processes and in the defense of the body; the correct level of total protein is between 60 and 60 g/l or between 6 and 8 g/dl
  28. Quantification of Treponema pallidum antibodies - Treponema pallidum is a bacteriom that is the etiological factor of venereal disease, syphilis.
  29. Prothrombin activity - this is a protein factor; is responsible for the formation of thrombin; the so-called Quick’s Index, i.e. the percentage of the norm - the correct result is in the range of 70-130%
  30. HBsAg - it is an antigen, a surface protein; its presence may indicate hepatitis B or a carrier of KBV; the presence of HBsAg is checked in the blood serum
  31. mean corpuscular volume - known as MCV; it is an indicator showing the volume of red blood cells, i.e. erythrocytes; the reference range for MCV is assumed to be 82-82 fl
  32. hematocrit - it is a ratio of blood cell volume to total blood cell volume to total blood volume; it is expressed as a percentage; the results depend primarily on the age, sex, study population; physical effort the the patient is engaged in, and even the method of determining; sample norms for specific sexes are as follows:
    • Womes: 36.1-44.3%
    • Males: 40.7-50.3%
  33. White blood cell count - leukocytes are white blood cells; in the human body, leukocytes play a very important role in the functioning of the immune system - they protect against cancer and infections; example of norms for blood leukocytes for adults are 4-10 thousand/μl
  34. Tumor necrosis factor α - it is a cytokine (protein) associated with the inflammatory process, produced mainly by active monocytes and macrophages, and in much smaller amounts by other tissues; the primary role of the tumor necrosis factor in the body is to modulate the inflammatory response; the norm o TFA-α is under 16 pg/ml
  35. mean corpuscular hemoglobin concentration - the norms for mean corpuscular hemoglobin concentration are between 19.2 and 23.6 mmol/l
  36. fibrinogen - it is a protein involved in the blood clotting prcess; the norm is 2-5 g/l
  37. Interleukin 1β - it is a cytokine that are crucial in the process of inflammation; it is produced in response to various types of antigens; the factors that stimulate its production can be bacteria, viruses or fungi; acts as a universal stimulant of the inflammatory response; it also has the ability to stimulate cells to produce other pro-inflammatory cytokines
  38. Urea - is the final product of protein metabolism in the body and as such is an indicator of kidney function; the appropriate standard of urea concentration is 2.5-6.7 mmol/l
  39. lymphocyte count - lymphocytes are a group of leukocytes - white blood cells; they protect us against infections and the development of cancer; the norm in an adult i sbetween 1000 and 5000; both their excess and their shortage may have serious consequences
  40. PH value - is a test that is ordered to confirm respiratory disorders; norms for pH (normal arterial blood: 7.35-7.45; venous blood: 7.32-7.42)
  41. Red blood cell count - the norms of erythrocytes in adults:
    • women: 4.2-5.4 million/mm³
    • males: 4.5-5.9 million/mm³
  42. Eosinophil count - the normal number of eosinophils in peripheral blood is 35-350 in 1 mm³ (mean is 125)
  43. Corrected calcium - calcium is a macronutrient element - its content in the adult human body is about 20-25 g/kg of lean body mass; calcium formula corrected calculates the theoretical concentration of calcium in a patient if the serum albumin concentration was 40 g/l; the reference values are 8.8-10.6 mg/dl
  44. Serum potassium - the test is designed to determine the concentration of potassium in the blood serum; the normal level is 3.5-5.0 mmol/l
  45. glucose - is a simple sugar that is the primary source of energy in the human body; it is needed for human organs to function properly; the blood sugar value depends on the patient’s age - for the adults norm is 3.9-5.5 mmol/l
  46. neutrophils count - the reference values for the amount of neutrophils in the blood are 1800-8000/μl - that is the number of cells per microliter of blood tested
  47. Direct bilirubin - direct bilirubin is a part of total bilirubin; the norm between 1.7-6.8 µmol/l
  48. Mean platelet volume - normally, the value of average platelet volume ranges between 9 and 12.6 μm³
  49. ferritin - it is an acute phase protein (its level increases when inflammation develops), found in the bone marrow, liver and spleen, kidneys and skeletal muscles; ferritin is a specific store of iron that protects the body in the event of incresed demand for this element and procets against its excess; in women, the level of ferritin should not exceed 200 mcg/l, in men 400 mcg/l and the result below 12 mcg/l indicates a deficiency
  50. RBC distribution width SD - is an indicator of red blood cell volume distribution; erythrocytes, or red blood cells are not indentical; the task of the RBC-SD index is to assess the differences in the volume of erythrocytes in the examined person; the RDW-SD norm is 36-47 fl (femtoliters)
  51. Thrombin time - thrombin time (TT) is a diagnostic test that allows a partial assessment of the coagulation system; on the basis of TT, it can be determined how long it takes for the final stage of blood clotting, i.e. the conversion of fibrinogen to fibrin; the correct value of the thrombin time should be in the range from 12 to 24 second
  52. (%)lymphocyte - normal percentage value of the lymphocyte is between 10-45% all white blood cells
  53. HCV antibody quantification - measurement of the amount of ribonucleic acid (RNA) of the hepatitis C virus (HCV) in the blood; an early response to treatment should be demonstrated by a fall viral load greater tha 2 logs after the first 12 weeks of treatment
  54. D-D dimer - they are a product of the breakdown of fibrin - a protein precipitated from the blood plasma during the coagultion process; the concentrations of D-D dimers in the plasma increase during the increased production of fibrin, which is related to the formation of clots that impede the proper blood flow; the concentration of D-D dimers is considered normal less that 500 μg/l
  55. Total cholesterol - it is producet in the liver and is supplied to the body with food and released into bloodstream; it is needed, among others the proper functioning of the nervous system; its excess may unfortunately increase the risk of a heart attack and stroke; for total cholesterol, it is assumed that the correct values shoudl be within the range of 200 mg/dl
  56. aspartate aminotransferase - is an intracellular enzyme found in the liver, heart muscles, kidneys and red blood cells; it enters the blood when there is cell damage; if a biochemical test shows increased activity, it usually means that we are dealing with one of the liver diseases; the norm for aspartate aminotransferase is in the range from 5 to 40 U/l
  57. Uric acid - it is an organic chemical compound that is one of the end products of metabolism; sometimes it accumulates in large amounts in the blood in the course of some metabolic diseases; the correct level of uric acid should be 180-420 µmol/l
  58. HCO3- - is a test of plasma bicarbonate concentration; the standard is 22-28 mmol/l
  59. calcium - calcium is present in the serum in ionized Ca+2 and bound form; decreased levels of calcium in the body (hypocalcaemia) are manifested by, among others, muscle cramps, joint pain, tingling and numbness; normal total calcium levels are 2.12-2.62 mmol/l
  60. Amino-terminal brain natriuretic peptide precursor(NT-proBNP) - is a cardiac marker; NT-proBNP is performed when heart failure is suspected; during myocardial infarcion, a significant increase in NT-proBNP is observed; the norm of NT-proBNP plasa concentration is 68-112 pg/ml
  61. Lactate dehydrogenase - it is an encyme that occurs in all cells of the human body; in the event of cell damage, lactate dehydrogenase is released from inside them, its concentration and activity increase in the blood; blood lactate dehydrogenase activity level is <480 IU/l
  62. platelet large cell ratio - that is, the percentage of large as well as giant platelets in the blood in the test sample; this parameter indicates that there are platelets in the patient’s bloodstream that are much larger than those within the specified for P-LCR is less than 30% of large platelets
  63. Interleukin 6 - is a signaling molecule belonging to the group of cytokines; is responsible for initiating and developing the inflammatory response in the body
  64. Fibrin degradation products - the degradation products of fibrinogen and fibrin are fragments of fibrin formes as a result of the intensification of the process of fibrinolysis - the breakdown of intravascular clots; they are formed by the action of plasmin on fibrin and fibrinogen; the norm is a result below 800 ng/ml
  65. monocytes count - 30 to 800 monocytes per microliter of blood are considered normal
  66. PLT distribution width - determines what percentage of all thrombocytes has different volume than a medium-sized thrombocyte; for PLT distribution width a value from 6.1 to 11 fl is normal
  67. globulin - is responsible in the human body for modulating immune processes; the norm should be from 5 to 15 g/l
  68. γ-glutamyl transpeptidase - (GGTP) is a membrane enzyme present in many tissues and body fluid of humans; the norm of GGTP activity in the blood is:
    • women: <35 IU/l
    • men: <40 IU/l
  69. International standard ratio - expresses prothrombin time which is one of the most important parameters in the examination of the coagulation system; under normal conditions, it is assumed that the norm is 0.8-1.2
  70. basophil count(#) - basophils are one of the morphotic components of the blood; they take part in the body’s defense against penetrating microorganisms; it is assumed that basophils should constitute 0-300 cells per microliter of blood
  71. 2019-nCoV nucleic acid detection - test that finds the presence of nucleic acid in the patient’s blood
  72. mean corpuscular hemoglobin - is an indicator of the average concentration of hemoglobin in the red blood cell; this parameter allows to assess the functionality of red blood cells in our body; the norm is 19.2-23.6 mmol/l
  73. Activation of partial thromboplastin time - it is a test that measures the coagulation time of the plasma in the presence of kephalin and calcium ions after activation with kaolin clay; the test is performed when a deficiency of coagulation factors and fibrinogen is suspected; the norm is about 26-46 second
  74. High sensitivity C-reactive protein - is a C-reactive protein i.e. one of the so-called acute phase proteins appearing in the blood as a consequence of inflammation; it is produced under the influence of inflammatory cytokines; tests are performed in cases where the risk of bacterial infection increases or there is such a suspiction based on existing symptoms; in a helthy person, the concentration is not high, not more than 5 mg/l
  75. HIV antibody quantification - determination of HIV-1 genetic material in blood, useful in assessing the effectiveness of antiviral therapy and in the prognosis of AIDS development
  76. serum sodium - the indicator for the test is suspected reduced concentration (so-called hyponatremia) or increased sodium concentration (so-called hypernatremia); the correct concentration is 135-145 mmol/l
  77. thrombocytocrit - is the ratio of thrombocyte volume to plasma; deviations may indicate a blood coagulation disored
  78. ESR - is the rate of descent of red blood cells over time; ESR norm depends on age and gender; ESR norms in blood test:
    • newborns: 0-2 mm/h
    • infants (from 6 month of age): 12-177 mm/h
    • women under 50: 6-11 mm/h
    • women over 50: up to 30 mm/h
    • males under 50: 3-8 mm/h
    • men over 50: up to 20 mm/h
  79. glutamic-pyruvic transaminase - it is the enzyme most commonly found in the liver, but is also found in skeletal muscle, heart muscle and kidneys; in a helthy person whose liver works properly, its level should be insignificant; the result considered as normal should not exceed 35-400 IU/l
  80. eGFR - colloquially it is called glomerular filtration; this indicator is a measurement of the amount of blood that gets filtered by the kidneys; normally the result should be greater than or equal to 90 ml/min/1.73m2
  81. creatinine - is a substance that is formed in our body as a result of metabolic changes from creatine phosphate by non-enzymatic breakdown of this compound; its concentration in blood and urine gives us information about the efficiency of our kidneys; the correct level of creatinine in the blood serum ranges from 53 to 115 μmol/l

Analysis

Introductory Part

Cleaning Process

That shows us, the dataset is illegible. Fisrt, we need to clean data to enhance readability of dataset.

data_df <- cov_cs_df %>%
              mutate(gender = as.factor(ifelse(gender==1, "male", "female"))) %>%
              mutate(outcome = as.factor(ifelse(outcome == 0, "survived", "died"))) %>%
              rename(admission_time = 'Admission time',
                     discharge_time = 'Discharge time')

colnames(data_df)[34] <- "Tumor_necrosis_factor_alfa"
colnames(data_df)[37] <- "Interleukin_1_Beta"
colnames(data_df)[68] <- "gamma_glutamyl_transpeptidase"

patients_df <- data_df %>%
                select(PATIENT_ID, age, gender, admission_time, discharge_time, outcome) %>%
                drop_na(PATIENT_ID) %>%
                mutate("hospitalization_length" = seconds_to_period(difftime(discharge_time,
                                                                             admission_time,
                                                                             units = "days" ))) %>%
                relocate(hospitalization_length, .after = discharge_time)

data_df <- data_df %>% 
            fill(PATIENT_ID)

After replace NA values in PATIENT_ID column, the dataset looks like below.

kable(head(data_df[1:8], 10))
PATIENT_ID RE_DATE age gender admission_time discharge_time outcome Hypersensitive cardiac troponinI
1 2020-01-31 01:09:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-01-31 01:25:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-01-31 01:44:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-01-31 01:45:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-01-31 01:56:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived 19.9
1 2020-01-31 01:59:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-01-31 02:09:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-01-31 06:44:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-02-04 19:42:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA
1 2020-02-06 09:14:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived NA

We know, that most of the patients had multiple blood samples taken throughout their stay in a hospital. To prepare the model of prediction the mortality of patients, the next step of a cleaning process is to extract only one observation to patient. Only the final sample data will be used for the model training and the testing.

data_df <- data_df %>%
            group_by(PATIENT_ID) %>%
            fill(colnames(data_df)) %>%
            summarise(across(everything(), last), .groups="drop")

Again, you can see, there is a NA values in some column. That requires replacing those values. It is possible to do it in many ways, but here, the NA values in each column will be replace by median of the values in each column.

for(i in 8:ncol(data_df))
{
  val <- data_df[i]
  val[is.na(val)] <- median(val[!is.na(val)])
  data_df[i] <- val
}

After that, the dataset is clean and looks like below.

kable(head(data_df[1:8], 10))
PATIENT_ID RE_DATE age gender admission_time discharge_time outcome Hypersensitive cardiac troponinI
1 2020-02-17 08:31:00 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 survived 19.9
2 2020-02-17 15:34:00 61 male 2020-02-04 21:39:03 2020-02-19 12:59:01 survived 1.9
3 2020-02-06 23:15:00 70 female 2020-01-23 10:59:36 2020-02-08 17:52:31 survived 12.3
4 2020-02-17 08:31:00 74 male 2020-01-31 23:03:59 2020-02-18 12:59:12 survived 4.8
5 2020-02-18 09:35:00 29 female 2020-02-01 20:59:54 2020-02-18 10:33:06 survived 5.6
6 2020-02-04 13:59:00 81 female 2020-01-24 10:47:10 2020-02-07 09:06:58 survived 19.7
7 2020-02-06 23:14:00 40 male 2020-01-20 17:54:43 2020-02-08 16:21:46 survived 2.4
8 2020-02-17 13:21:00 47 male 2020-01-30 13:42:08 2020-02-18 11:45:27 survived 3.6
9 2020-02-17 21:59:00 38 female 2020-01-28 23:06:46 2020-02-18 16:46:01 survived 12.3
10 2020-02-17 08:53:00 30 male 2020-02-01 19:50:17 2020-02-19 12:12:57 survived 3.3

Summary

Below short summary of clean dataset.

tbl_summary(
  data_df,
  by = outcome,
  label = gender ~ "Gender") %>%
  add_n() %>%
  modify_header(label = "") %>%
  add_overall() %>%
  bold_labels()
Overall, N = 3751 N died, N = 1741 survived, N = 2011
PATIENT_ID 188 (94, 282) 375 288 (245, 332) 101 (51, 151)
age 62 (46, 70) 375 69 (62, 77) 51 (37, 62)
Gender 375
female 151 (40%) 48 (28%) 103 (51%)
male 224 (60%) 126 (72%) 98 (49%)
Hypersensitive cardiac troponinI 12 (4, 32) 375 34 (12, 254) 5 (2, 12)
hemoglobin 125 (113, 137) 375 122 (109, 137) 126 (117, 137)
Serum chloride 102 (100, 105) 375 103 (100, 111) 102 (100, 103)
Prothrombin time 14.3 (13.4, 16.2) 375 16.3 (14.9, 18.8) 13.6 (13.1, 14.1)
procalcitonin 0.10 (0.04, 0.31) 375 0.32 (0.11, 0.93) 0.04 (0.02, 0.10)
eosinophils(%) 0.25 (0.00, 1.40) 375 0.00 (0.00, 0.10) 1.30 (0.50, 2.10)
Interleukin 2 receptor 664 (594, 752) 375 664 (664, 1,182) 664 (460, 664)
Alkaline phosphatase 71 (54, 96) 375 91 (68, 129) 61 (50, 75)
albumin 33.2 (28.6, 37.5) 375 28.1 (24.6, 31.7) 36.9 (33.4, 39.3)
basophil(%) 0.20 (0.10, 0.40) 375 0.10 (0.10, 0.20) 0.30 (0.20, 0.50)
Interleukin 10 5.2 (5.0, 6.6) 375 5.2 (5.2, 11.8) 5.2 (5.0, 5.2)
Total bilirubin 11 (7, 16) 375 13 (10, 22) 8 (6, 12)
Platelet count 192 (118, 256) 375 114 (59, 184) 238 (192, 297)
monocytes(%) 6.2 (3.2, 8.7) 375 3.0 (2.0, 5.4) 8.3 (6.9, 10.1)
antithrombin 87 (86, 88) 375 87 (76, 87) 87 (87, 93)
Interleukin 8 15 (13, 18) 375 15 (15, 33) 15 (8, 15)
indirect bilirubin 5.3 (3.9, 7.8) 375 5.5 (4.3, 8.5) 5.1 (3.7, 7.2)
Red blood cell distribution width 12.75 (12.10, 13.70) 375 13.20 (12.62, 14.60) 12.40 (11.90, 13.00)
neutrophils(%) 78 (63, 92) 375 92 (87, 95) 64 (55, 72)
total protein 66 (62, 70) 375 63 (58, 68) 68 (65, 71)
Quantification of Treponema pallidum antibodies 0.050 (0.040, 0.060) 375 0.050 (0.050, 0.060) 0.050 (0.040, 0.060)
Prothrombin activity 86 (68, 97) 375 67 (53, 80) 94 (87, 103)
HBsAg 0.010 (0.000, 0.010) 375 0.010 (0.010, 0.010) 0.010 (0.000, 0.010)
mean corpuscular volume 90.4 (87.2, 93.8) 375 90.4 (87.6, 96.0) 90.3 (87.0, 92.5)
hematocrit 36.3 (33.4, 39.9) 375 35.9 (32.4, 39.9) 36.9 (34.1, 39.9)
White blood cell count 8 (5, 13) 375 12 (8, 17) 6 (5, 8)
Tumor_necrosis_factor_alfa 8.3 (7.7, 8.8) 375 8.3 (8.3, 11.4) 8.3 (6.8, 8.3)
mean corpuscular hemoglobin concentration 342 (332, 349) 375 342 (329, 349) 342 (334, 349)
fibrinogen 4.22 (3.55, 5.10) 375 4.22 (3.01, 5.28) 4.22 (3.64, 5.02)
Interleukin_1_Beta 5.00 (5.00, 5.00) 375 5.00 (5.00, 5.00) 5.00 (5.00, 5.00)
Urea 5 (4, 11) 375 11 (6, 18) 4 (3, 5)
lymphocyte count 0.99 (0.54, 1.52) 375 0.52 (0.32, 0.80) 1.46 (1.08, 1.80)
PH value 6.00 (6.00, 6.50) 375 6.00 (6.00, 6.50) 6.00 (6.00, 6.50)
Red blood cell count 4.10 (3.58, 4.62) 375 4.10 (3.54, 4.60) 4.11 (3.61, 4.63)
Eosinophil count 0.02 (0.00, 0.08) 375 0.00 (0.00, 0.01) 0.07 (0.03, 0.11)
Corrected calcium 2.37 (2.26, 2.44) 375 2.35 (2.25, 2.42) 2.38 (2.28, 2.45)
Serum potassium 4.43 (4.06, 4.79) 375 4.43 (3.93, 5.04) 4.43 (4.17, 4.68)
glucose 6.5 (5.2, 9.4) 375 8.7 (6.7, 12.4) 5.3 (4.8, 6.5)
neutrophils count 5 (3, 11) 375 11 (7, 15) 4 (3, 5)
Direct bilirubin 5 (3, 7) 375 7 (5, 12) 3 (3, 5)
Mean platelet volume 10.80 (10.20, 11.60) 375 11.30 (10.80, 12.30) 10.40 (9.90, 10.90)
ferritin 760 (621, 856) 375 760 (760, 1,425) 760 (417, 760)
RBC distribution width SD 41.2 (38.9, 44.6) 375 43.1 (40.4, 48.6) 40.3 (38.0, 42.1)
Thrombin time 16.55 (15.80, 17.40) 375 16.55 (16.12, 18.58) 16.55 (15.70, 16.90)
(%)lymphocyte 14 (5, 27) 375 4 (2, 9) 25 (18, 33)
HCV antibody quantification 0.06 (0.05, 0.08) 375 0.06 (0.05, 0.09) 0.06 (0.05, 0.07)
D-D dimer 1 (1, 9) 375 11 (2, 21) 1 (0, 1)
Total cholesterol 3.72 (2.99, 4.36) 375 3.20 (2.61, 3.72) 4.24 (3.69, 4.69)
aspartate aminotransferase 25 (19, 40) 375 40 (25, 64) 20 (16, 25)
Uric acid 260 (203, 342) 375 260 (190, 381) 260 (206, 325)
HCO3- 23.9 (21.0, 26.2) 375 21.2 (17.6, 23.9) 25.5 (23.8, 27.4)
calcium 2.11 (2.00, 2.22) 375 1.99 (1.89, 2.10) 2.20 (2.11, 2.29)
Amino-terminal brain natriuretic peptide precursor(NT-proBNP) 304 (123, 866) 375 921 (304, 3,666) 163 (29, 304)
Lactate dehydrogenase 274 (202, 598) 375 618 (437, 880) 204 (177, 245)
platelet large cell ratio 31 (26, 37) 375 35 (31, 42) 28 (24, 32)
Interleukin 6 18 (13, 24) 375 18 (18, 70) 18 (3, 18)
Fibrin degradation products 6 (5, 7) 375 7 (6, 150) 6 (4, 6)
monocytes count 0.43 (0.32, 0.60) 375 0.40 (0.21, 0.56) 0.47 (0.37, 0.61)
PLT distribution width 12.50 (11.10, 14.30) 375 13.60 (12.50, 15.95) 11.70 (10.70, 12.70)
globulin 32.4 (29.1, 35.7) 375 33.7 (31.2, 37.5) 31.2 (28.2, 33.4)
gamma_glutamyl_transpeptidase 33 (22, 52) 375 38 (27, 72) 29 (18, 44)
International standard ratio 1.10 (1.02, 1.29) 375 1.30 (1.15, 1.54) 1.04 (0.98, 1.09)
basophil count(#) 0.020 (0.010, 0.030) 375 0.020 (0.010, 0.030) 0.020 (0.010, 0.030)
2019-nCoV nucleic acid detection 375
-1 375 (100%) 174 (100%) 201 (100%)
mean corpuscular hemoglobin 30.90 (29.80, 32.20) 375 30.90 (30.00, 32.27) 30.90 (29.60, 32.10)
Activation of partial thromboplastin time 39 (37, 43) 375 39 (37, 44) 39 (36, 41)
High sensitivity C-reactive protein 26 (2, 96) 375 97 (57, 177) 2 (1, 13)
HIV antibody quantification 0.09 (0.08, 0.10) 375 0.09 (0.07, 0.10) 0.09 (0.08, 0.10)
serum sodium 141 (138, 143) 375 141 (138, 147) 141 (139, 142)
thrombocytocrit 0.21 (0.15, 0.27) 375 0.16 (0.10, 0.21) 0.25 (0.21, 0.31)
ESR 28 (16, 40) 375 28 (20, 46) 28 (14, 36)
glutamic-pyruvic transaminase 26 (17, 41) 375 27 (19, 46) 25 (16, 37)
eGFR 89 (68, 104) 375 70 (41, 91) 98 (86, 112)
creatinine 74 (58, 97) 375 88 (66, 148) 66 (55, 83)

1 Statistics presented: Median (IQR); n (%)

With clean dataset, it is possible to start analyse dataset.

Proper part

Plots including data about gender, age and hospitalization length

First, we can check number of patients divided into genders.

gender_bar_plot <- ggplot(patients_df, aes(x = gender, fill = gender)) +
                geom_bar() +
                theme_bw() + 
                labs(title = "Number of patients divided into gender",
                    x = "Gender",
                    y = "Number of patients",
                fill = "Gender")
ggsave(filename = "gender_bar_plot.svg",
       plot = gender_bar_plot,
       device = "svg",
       path = "plots")
ggplotly(gender_bar_plot)

Below the chart showing ages of patients.

gender_age_hist <- ggplot(patients_df, aes(x = age, fill=gender)) +
                    geom_histogram(binwidth = 5.0) +
                    theme_bw() +
                    labs(title = "Histogram of patients age and genders.",
                         x = "Age",
                         y = "Number of patients",
                         fill = "Gender") +
                    scale_x_continuous(breaks = seq(0, max(patients_df$age), 5))
ggsave(filename = "gender_age_plot.svg",
       plot = gender_age_hist,
       device = "svg",
       path = "plots")
ggplotly(gender_age_hist)

It is possible to create two histograms for each gender.

gender_age_histograms <- ggplot(patients_df, aes(x = age, fill=gender)) +
                          geom_histogram(binwidth = 5.0) +
                          theme_bw() +
                          facet_wrap(~gender) +
                          labs(title = "Histogram of patients age and genders.",
                               x = "Age",
                               y = "Number of patients",
                               fill = "Gender") +
                          scale_x_continuous(breaks = seq(0, max(patients_df$age), 5))
ggsave(filename = "gender_age_histograms.svg",
       plot = gender_age_histograms,
       device = "svg",
       path = "plots")

ggplotly(gender_age_histograms)

Let show the histograms about outcome due to hospitalization length. First, histogram including each genders.

hosp_length_hist <- ggplot(patients_df, aes(x = hospitalization_length, fill = gender)) + 
                      geom_histogram(binwidth = 1.0) +
                      theme_bw() +
                      scale_x_continuous(breaks=seq(0, max(patients_df$hospitalization_length), 1)) +
                      labs(title = "Number of patients and their hospitalization length",
                           x = "Hospitalization length (in days)",
                           y = "Number of patients", 
                           fill= "Gender")
ggsave(filename = "hospitalization_length_histograms.svg",
       plot = hosp_length_hist,
       device = "svg",
       path = "plots")

ggplotly(hosp_length_hist)

Next chart will be the histogram including information about did the patient survived or died.

hosp_length_outcome_hist <- ggplot(patients_df, aes(x = hospitalization_length,
                                                    fill = outcome)) +
                              geom_histogram(binwidth = 1.0) +
                              theme_bw() +
                              scale_x_continuous(breaks = seq(0, max(patients_df$hospitalization_length), 1)) +
                              labs(title = "Number of patients, their hospitalization length and outcome (survived or died)",
                                   x = "Hospitalization length (in days)",
                                   y = "Number of patients",
                                   fill = "Outcome")

ggsave(filename = "hospitalization_length_outcome_histogram.svg",
       plot = hosp_length_outcome_hist,
       device = "svg",
       path = "plots")


ggplotly(hosp_length_outcome_hist)

Next, the histograms including every information above, but divided by genders and outcomes.

hosp_length_outcomes_hists <- ggplot(patients_df, aes(x = hospitalization_length,
                                                    fill = outcome)) +
                              geom_histogram(binwidth = 1.0) +
                              theme_bw() +
                              facet_grid(gender~outcome) +
                              scale_x_continuous(breaks = seq(0, max(patients_df$hospitalization_length), 5)) +
                              scale_y_continuous(breaks = seq(0, 20, 4)) +
                              labs(title = "Number of patients, their hospitalization length and outcome (survived or died)",
                                   x = "Hospitalization length (in days)",
                                   y = "Number of patients",
                                   fill = "Outcome")

ggsave(filename = "hospitalization_length_outcomes_histograms.svg",
       plot = hosp_length_outcomes_hists,
       device = "svg",
       path = "plots")
## Saving 10 x 5 in image
ggplotly(hosp_length_outcomes_hists)

Animated plots

Following charts will be animated. Next will show the number of patients death in next days

patients_death <- patients_df %>%
                    select(discharge_time, outcome) %>%
                    filter(outcome == "died") %>%
                    mutate(discharge_time = as.integer(difftime(discharge_time,
                                                        min(patients_df$discharge_time),
                                                        units="days"))) %>%
                    group_by(discharge_time) %>%
                    summarise(num_of_deaths = n(), .groups="drop")

animated_deaths_plot <- ggplot(patients_death, aes(x = discharge_time,
                                            y = num_of_deaths)) +
                  geom_line(color = "blue",
                            size=1.5) +
                  theme_bw() +
                  labs(title = "Number of patients death in next days (from addition day)",
                       x = "Days after addition day",
                       y = "Number of deaths") +
                  scale_x_continuous(breaks = seq(0, max(patients_death$discharge_time), 2)) +
                  scale_y_continuous(breaks = seq(0, max(patients_death$num_of_deaths), 1)) +
                  transition_reveal(discharge_time)

anim_save("plots/deaths.gif", animated_deaths_plot)

The chart below will show the aggregate number of deaths in next days.

patients_death <- patients_death %>%
                    arrange(discharge_time) %>%
                    mutate(num_of_deaths_agg = cumsum(num_of_deaths))

animated_deaths_agg_plot <- ggplot(patients_death,
                                   aes(x = discharge_time,
                                       y = num_of_deaths_agg)) +
                              geom_line(color = "blue",
                                        size = 1.5) +
                              theme_bw() +
                              labs(title = "Number of aggregate patients death in next days (from addition day)",
                                   x = "Days after addition day",
                                   y = "Number of deaths (aggregate)") +
                              scale_x_continuous(breaks = seq(0, max(patients_death$discharge_time), 2)) +
                              scale_y_continuous(breaks = seq(0, max(patients_death$num_of_deaths_agg), 10)) +
                              transition_reveal(discharge_time)

anim_save("plots/deaths_agg.gif", animated_deaths_agg_plot)

Machine Learning Model

In the following chapter, an attempt was made to create a model that will determine the probability of death from COVID-19 based on the observations and the results of the blood tests.

As assumption was made that each row will be separate observation of unique patient. First you should extract right columns from dataset. The column containing id of the patients is not needed - just like the columns with dates of research and admission/discharge time (we do not know how long the patient will be hospitalized). Also, we need to shuffle rows in dataframe. It is necessary, because first patient who died, is on 202 position in dataframe and we cannot start teaching models before both classes will occur in dataset. To ensure repeatability of the shuffling, the seed is seting.

ml_df <- data_df %>%
          select(-c("PATIENT_ID", "RE_DATE", "discharge_time","admission_time"))
set.seed(17)
rows <- sample(nrow(ml_df))
ml_df <- ml_df[rows,]
After preparing the dataset, it is possible to split data into two datasets:
  • training dataset - the model will be learning on this datas
  • testing dataset - learned model will be tested with this datas

As the learning procces, the repeated cross-validation method was choosen. It means, that the learning procces will be done several times (argument repeats in the trainControl function). Each time, the training set will be splitting into two sets, training set and validation set. The training set will be used to learn model and after that, will be validate with validation set.

inTraining <- createDataPartition(y = ml_df$outcome, p=.70, list=FALSE)
training <- ml_df[inTraining,]
testing <- ml_df[-inTraining,]

ctrl <- trainControl(method="repeatedcv",
                     number = 2,
                     repeats = 5,
                     classProbs = TRUE)

The following subchapters will checking some machine learning methods and the differences between them in accuracy, performance and errors.

k Nearest Neighbours

The Nearest Neighbours classifier belongs to a group of classifiers based on instance methods. The classification process is done online when a new case needs to be classified. Such methods are generally called lazy learning methods. The idea behind k-NN algorithm is as follows. When there is a need to classify a new object X, the k_NN classifier searches in the pattern space for k objects of the training set closest to the object X. Object X, using majority voting, is assigned to the class that dominates in the set of its k nearest neighbours.

At the beginning, the seed is seting. It ensure the repeatability of experiments.

set.seed(17)

Then, it is possible to start learn process with k-NN method. The attribute k is seting on 10.

tGrid <- expand.grid(k = 10)
knn_fit <- train(outcome ~ .,
                data = training,
                method = "knn",
                trControl = ctrl,
                tuneGrid = tGrid)
After training process, the model needs to be tested. The testing set will be used to do that. As a result of testing the model, two types of results can be obtained:
  • classifying samples into classes - this result is needed to create confusion matrix
  • probability of samples of belonging to classes - this result is needed to create ROC curve

In the following lines, both type of results will be obtained.

knnClasses_prob <- predict(knn_fit,
                          newdata = testing,
                          type = "prob")

knnClasses <- predict(knn_fit,
                     newdata = testing)

With the result with classified samples, it is possible to create confusion matrix. The confusion matrix is the m x m matrix, where m is the number of class labels where the rows correspond to the actual class labels of the test records, and the columns to the class labels assigned to the test records by the classifier. The caret library enables the function to create confusion matrix. This function returns confusion matrix and basic performance statistics.

knn_confMatrix <- caret::confusionMatrix(data = knnClasses, testing$outcome)
knn_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       48        7
##   survived    4       53
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8311, 0.9499)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8033          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 0.8833          
##          Pos Pred Value : 0.8727          
##          Neg Pred Value : 0.9298          
##              Prevalence : 0.4643          
##          Detection Rate : 0.4286          
##    Detection Prevalence : 0.4911          
##       Balanced Accuracy : 0.9032          
##                                           
##        'Positive' Class : died            
## 

You can see the basic performance statistics. The model is not perfect, but its prediction is quite good. Confusion matrix analysis is one of the many ways to assess the accuracy of the model. Another one is ROC curve and area under ROC curve (AUC). ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how capable of distinguishing between the classes the model is. The higher the AUC, the better the model is at predicting survived as survived and died as died. By analogy, the higher the AUC, the better the model is at distinguishing between patients with and without disease.

knn_ROC <- roc(response = testing$outcome,
           predictor = knnClasses_prob[, "died"],
           levels = rev(levels(testing$outcome)))

plot(knn_ROC,
     print.auc = TRUE,
     col="black",
     lwd = 5)

legend("bottomright",
       col=c(par("fg"), "blue"),
       legend = c("Empirical"),
       lwd=5)

The AUC value is quite big. Now, its need to check if an overfitting occured in our model.

Overfitting is over-matching to the specificty of the training data is most often assciotaed with the loss of generalizability. To check if model is or not is overfitted, the special function was created to create learning curves (similar functions was created also to others testing methods).

my_learning_curve_knn <- function(training_data,
                              test_data,
                              outcome = NULL,
                              trControl = NULL)
{
  train_len <- nrow(training_data)
  if(train_len < 2)
    stop("Training data must have at least 2 data points.")
  if(nrow(test_data) < 1)
    stop("Test data must have at least 1 data points")
  if(is.null(outcome))
    stop("Please give a character string for the outcome column name.")
  if(is.null(trControl))
    stop("Please put the train controller.")
  learnCurve <- data.frame(m = integer(train_len),
                           RMSE_type = character(train_len),
                           RMSE_value = integer(train_len))

  j <- 0
  tGrid <- expand.grid(k = 10)
  for (i in c(seq(5,train_len, 20), train_len))
  {
    j <- j + 1
    tr <- training_data[1:i,]
    set.seed(17)
    
    trainModel <- train(outcome ~ .,
                    data = tr,
                    method = "knn",
                    trControl = trControl,
                    tuneGrid = tGrid)
    
  
    learnCurve$m[j] <- i
    learnCurve$RMSE_type[j] <- "Training Error"
    learnCurve$RMSE_value[j] <- rmse(tr$outcome,
                                    predict(trainModel,
                                            newdata = tr))
    j <- j + 1
    
    learnCurve$m[j] <- i
    learnCurve$RMSE_type[j] <- "Testing Error"
    learnCurve$RMSE_value[j] <- rmse(test_data$outcome,
                                 predict(trainModel,
                                         newdata = test_data))
  }
  
  learnCurve[1:j,]
}

This function generates learning curves for testing and training set of model. Below chart shows learning curves for training set and testing set.

knn_lmc <- my_learning_curve_knn(training_data = training,
                  test_data = testing,
                  outcome = "outcome",
                  trControl = ctrl)

knn_learning_curves <- ggplot(knn_lmc, aes(x = m)) +
                          geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) + 
                          theme_bw() +
                          labs(title = "Learning Curves",
                               x = "Examples",
                               y = "RMSE value",
                               color = "Learning Curve")
ggsave(filename = "k_nn_learning_curves.svg",
       plot = knn_learning_curves,
       device = "svg",
       path = "plots")

ggplotly(knn_learning_curves)

On the graph you can see the learning curves. On the X axis there are the numbers of examples that was included into training process. On the Y axis there is the value of root mean square error. The chart with learning curves shows us, that error on the training set is similar to the error on the testing set. It is not too big, so we can assume that model is not overfitted.

SVM

SVM is classification method whose learning aim is to determine the separating hyperplane with a maximum margin of examples belonging to two classes.

At the beginning, the seed is seting. It ensure the repeatability of experiments.

set.seed(17)

Then, it is possible to start learn process with SVM method. We choose the svmRadial method - the kernel used in training and predicting is based on radial basis.

svm_fit <- train(outcome ~ .,
              data = training,
              method = "svmRadial",
              trControl = ctrl)

Like in the k-NN method, the prediction with samples classified and probability will be created.

svmClasses <- predict(svm_fit,
                      newdata = testing)
svmClasses_prob <- predict(svm_fit,
                           newdata = testing,
                           type = "prob")

Below you can find the confusion matrix for the model with SVM method.

svm_confMatrix <-caret::confusionMatrix(testing$outcome,
                                        svmClasses)
svm_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       49        3
##   survived    4       56
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.8755, 0.9745)
##     No Information Rate : 0.5268          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8745          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9245          
##             Specificity : 0.9492          
##          Pos Pred Value : 0.9423          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.4732          
##          Detection Rate : 0.4375          
##    Detection Prevalence : 0.4643          
##       Balanced Accuracy : 0.9368          
##                                           
##        'Positive' Class : died            
## 

The model made fewer errors than the k-NN model Below the ROC curve for model with SVM.

svm_ROC <- roc(response = testing$outcome,
               predictor = svmClasses_prob[, "died"],
               levels = rev(levels(testing$outcome)))

plot(svm_ROC,
     print.auc = TRUE,
     col="black",
     lwd = 5)

legend("bottomright",
       col=c(par("fg"), "blue"),
       legend = c("Empirical"),
       lwd = 5)

AUC is big, like in the k-NN model. Below the function for creating learning curves for SVM method.

my_learning_curve_svm <- function(training_data,
                              test_data,
                              outcome = NULL,
                              trControl = NULL)
{
  train_len <- nrow(training_data)
  if(train_len < 2)
    stop("Training data must have at least 2 data points.")
  if(nrow(test_data) < 1)
    stop("Test data must have at least 1 data points")
  if(is.null(outcome))
    stop("Please give a character string for the outcome column name.")
  if(is.null(trControl))
    stop("Please put the train controller.")
  learnCurve <- data.frame(m = integer(train_len),
                           RMSE_type = character(train_len),
                           RMSE_value = integer(train_len))

  j <- 0
  tGrid <- expand.grid(k = 10)
  for (i in c(seq(20,train_len, 20), train_len))
  {
    j <- j + 1
    tr <- training_data[1:i,]
    set.seed(17)
    
    trainModel <- train(outcome ~ .,
                    data = tr,
                    method = "svmRadial",
                    trControl = ctrl)
    
  
    learnCurve$m[j] <- i
    learnCurve$RMSE_type[j] <- "Training Error"
    learnCurve$RMSE_value[j] <- rmse(tr$outcome,
                                    predict(trainModel,
                                            newdata = tr))
    j <- j + 1
    
    learnCurve$m[j] <- i
    learnCurve$RMSE_type[j] <- "Testing Error"
    learnCurve$RMSE_value[j] <- rmse(test_data$outcome,
                                 predict(trainModel,
                                         newdata = test_data))
  }
  
  learnCurve[1:j,]
}

This function generates learning curves for a testing and a training set for model with SVM method. Below chart shows learning curves for training set and testing set.

svm_lmc <- my_learning_curve_svm(training_data = training,
                                 test_data = testing,
                                 outcome = "outcome",
                                 trControl = ctrl)

svm_learning_curves <- ggplot(svm_lmc, aes(x = m)) +
                        geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) + 
                        theme_bw() +
                        labs(title = "Learning Curves",
                             x = "Examples",
                             y = "RMSE value",
                             color = "Learning Curve")

ggsave(filename = "svm_learning_curves.svg",
       plot = svm_learning_curves,
       device = "svg",
       path = "plots")

ggplotly(svm_learning_curves)

The error for the training set is lower than error for the testing set. You can see, that value of both errors are lower that errors in k-NN method. That means, the SVM model is better than k-NN.

Random Forests

Random Forests belong to the category of methods of constructing complex classifiers using the idea of modifying the set of attributes of the training data set. The Random Forest method uses only decision trees as base classifiers.

In the beginning, the seed is seting. It ensure the repeatability of experiments.

set.seed(17)

Then, it is possible to start learn process with Random Forests method. The number of trees in forest is set on 10.

rf_fit <- train(outcome ~ .,
                data = training,
                method = "rf",
                trControl = ctrl,
                ntree=10)

As previously, the prediction with samples classified and probability will be created.

rfClasses_prob <- predict(rf_fit,
                          newdata = testing,
                          type = "prob")

rfClasses <- predict(rf_fit,
                     newdata = testing)

importance_vector <- as.vector(rf_fit\(finalModel\)importance) Below the confusion matrix for the model with Random Forests method.

rf_confMatrix <- caret::confusionMatrix(data = rfClasses, testing$outcome)
rf_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       51        2
##   survived    1       58
##                                           
##                Accuracy : 0.9732          
##                  95% CI : (0.9237, 0.9944)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9462          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9808          
##             Specificity : 0.9667          
##          Pos Pred Value : 0.9623          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.4643          
##          Detection Rate : 0.4554          
##    Detection Prevalence : 0.4732          
##       Balanced Accuracy : 0.9737          
##                                           
##        'Positive' Class : died            
## 

This model was the least mistaken comparing to the previous models. Below the ROC curve for model with Random Forests

rf_ROC <- roc(response = testing$outcome,
           predictor = rfClasses_prob[, "died"],
           levels = rev(levels(testing$outcome)))

plot(rf_ROC,
     print.auc = TRUE,
     col="black",
     lwd = 5)

legend("bottomright",
       col=c(par("fg"), "blue"),
       legend = c("Empirical"),
       lwd = 5)

AUC is, similar to the previous models, big. Below the function for creating learning curves.

my_learning_curve_rf <- function(training_data,
                                test_data,
                                outcome = NULL,
                                trControl = NULL)
{
  train_len <- nrow(training_data)
  if(train_len < 2)
    stop("Training data must have at least 2 data points.")
  if(nrow(test_data) < 1)
    stop("Test data must have at least 1 data points")
  if(is.null(outcome))
    stop("Please give a character string for the outcome column name.")
  if(is.null(trControl))
    stop("Please put the train controller.")
  learnCurve <- data.frame(m = integer(train_len),
                           RMSE_type = character(train_len),
                           RMSE_value = integer(train_len))
  j <- 0
  
  
  for (i in c(seq(5, train_len, 20), train_len))
  {
    j <- j + 1
    tr <- training_data[1:i,]
    set.seed(17)
    
    trainModel <- train(outcome ~ .,
                    data = tr,
                    method = "rf",
                    trControl = trControl,
                    ktree = 10)
    
  
    learnCurve$m[j] <- i
    learnCurve$RMSE_type[j] <- "Training Error"
    learnCurve$RMSE_value[j] <- rmse(tr$outcome,
                                    predict(trainModel,
                                            newdata = tr))
    j <- j + 1
    
    learnCurve$m[j] <- i
    learnCurve$RMSE_type[j] <- "Testing Error"
    learnCurve$RMSE_value[j] <- rmse(test_data$outcome,
                                 predict(trainModel,
                                         newdata = test_data))
  }
  
  learnCurve[1:j,]
}

This function generates learning curves for testing and training set for model with Random Forests method. Belows chart shows learning curves for training set and testing set.

rf_mlc <- my_learning_curve_rf(training_data = training,
                               test_data = testing,
                               outcome = "outcome",
                               trControl = ctrl)

rf_learning_curves <- ggplot(rf_mlc, aes(x = m)) +
                        geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) + 
                        theme_bw() +
                        labs(title = "Learning Curves",
                             x = "Examples",
                             y = "RMSE value",
                             color = "Learning Curve")

ggsave(filename = "rf_learning_curves.svg",
       plot = rf_learning_curves,
       device = "svg",
       path = "plots")

ggplotly(rf_learning_curves)

The error during the training process is less than the error during the testing process and the values of both errors is quite low.

Comparison

We created 3 models of prediction based on 3 different methods of learning. Below the comparison of them.

First, we can compare the confusion matrix of each model. Below the confusion matrix for k-NN method.

knn_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       48        7
##   survived    4       53
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8311, 0.9499)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8033          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 0.8833          
##          Pos Pred Value : 0.8727          
##          Neg Pred Value : 0.9298          
##              Prevalence : 0.4643          
##          Detection Rate : 0.4286          
##    Detection Prevalence : 0.4911          
##       Balanced Accuracy : 0.9032          
##                                           
##        'Positive' Class : died            
## 

Below the confusion matrix fot SVM method.

svm_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       49        3
##   survived    4       56
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.8755, 0.9745)
##     No Information Rate : 0.5268          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8745          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9245          
##             Specificity : 0.9492          
##          Pos Pred Value : 0.9423          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.4732          
##          Detection Rate : 0.4375          
##    Detection Prevalence : 0.4643          
##       Balanced Accuracy : 0.9368          
##                                           
##        'Positive' Class : died            
## 

Below the confusion matrix for Random Forests method.

rf_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       51        2
##   survived    1       58
##                                           
##                Accuracy : 0.9732          
##                  95% CI : (0.9237, 0.9944)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9462          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9808          
##             Specificity : 0.9667          
##          Pos Pred Value : 0.9623          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.4643          
##          Detection Rate : 0.4554          
##    Detection Prevalence : 0.4732          
##       Balanced Accuracy : 0.9737          
##                                           
##        'Positive' Class : died            
## 

We see, the best confusion matrix is the best for method with Random Forests. That model made the fewest number of mistakes.

Let’s check the ROC curves. Below the charts with the ROC curves.

plot(knn_ROC,
     col="red",
     lwd = 5)
plot(svm_ROC,
     col = "green",
     add = TRUE,
     lwd = 5)
plot(rf_ROC,
     col = "blue",
     add = TRUE,
     lwd = 5)
legend("bottomright",
       col=c("red", "green", "blue"),
       legend = c("k-NN", "SVM", "Random Forests"),
       lwd = 5)

On the chart above we see, that the ROC curve is the most convex for the Random Forests method.

Let’s compare the learning curves. First, we need to prepare datasets with results of learning.

knn_train_lmc <- knn_lmc %>%
                  filter(RMSE_type == "Training Error") %>%
                  mutate(method = "k-NN")
knn_test_lmc <- knn_lmc %>%
                  filter(RMSE_type == "Testing Error") %>%
                  mutate(method = "k-NN")

svm_train_lmc <- svm_lmc %>%
                  filter(RMSE_type == "Training Error") %>%
                  mutate(method = "SVM")
svm_test_lmc <- svm_lmc %>%
                  filter(RMSE_type == "Testing Error") %>%
                  mutate(method = "SVM")

rf_train_lmc <- rf_mlc %>%
                  filter(RMSE_type == "Training Error") %>%
                  mutate(method = "Random Forests")
rf_test_lmc <- rf_mlc %>%
                  filter(RMSE_type == "Testing Error") %>%
                  mutate(method = "Random Forests")

training_curves <- rbind(knn_train_lmc, svm_train_lmc, rf_train_lmc)


testing_curves <- rbind(knn_test_lmc, svm_test_lmc, rf_test_lmc)

After that, it is possible to create plots and compare then. First, we can compare the learning curves for the training set.

comp_training_curves <- ggplot(training_curves, aes(x = m)) +
                          geom_line(aes(y = RMSE_value, color = method), size=2.0) + 
                          theme_bw() +
                          labs(title = "Learning Curves for training",
                               x = "Examples",
                               y = "RMSE value",
                               color = "Method of learning")

ggsave(filename = "comparison_training_curves.svg",
       plot = comp_training_curves,
       device = "svg",
       path = "plots")

ggplotly(comp_training_curves)

We see, that the least training error is made by the model which learning method for Random Forests. Below the comparison of learning curves during the testing process.

comp_testing_curves <- ggplot(testing_curves, aes(x = m)) +
                        geom_line(aes(y = RMSE_value, color = method), size=2.0) + 
                        theme_bw() +
                        labs(title = "Learning Curves for testing",
                             x = "Examples",
                             y = "RMSE value",
                             color = "Method of learning")

ggsave(filename = "comparison_testing_curves.svg",
       plot = comp_testing_curves,
       device = "svg",
       path = "plots")

ggplotly(comp_testing_curves)

As you could see before, the best score of an error is in the model, which learning method was Random Forests. We can assume, that the best model of prediction is the model which learning method is Random Forests. In the following chapter, there will be attempt to improve the model by tuning.

Tuning best model

To improve our model, we can reduce the number of attributes used in training the model. Let check what attributes are the most important in our best model. The importance of an attribute can be judged on the basis of the mean decrease gini index. Below, you will find a table with the attributes sorted by importance for the best model.

attributes <- colnames(select(ml_df, -"outcome"))
importance_vector <- as.vector(rf_fit$finalModel$importance)

importance_df <- data.frame(matrix(c(attributes, importance_vector), ncol = 2))
colnames(importance_df) <- c("Attribute", "Importance")
importance_df <- importance_df %>%
                  arrange(desc(as.numeric(Importance)))
kable(importance_df)
Attribute Importance
High sensitivity C-reactive protein 44.1607344234113
Lactate dehydrogenase 18.0836587727037
D-D dimer 14.7009606239862
neutrophils(%) 13.3725297874191
International standard ratio 7.92895531755091
(%)lymphocyte 7.80154653278904
Platelet count 5.82501188186179
age 4.69779397159593
procalcitonin 3.59753869309935
eosinophils(%) 1.85726499671876
thrombocytocrit 1.61405463249789
aspartate aminotransferase 0.988070175438596
lymphocyte count 0.772881355932203
monocytes(%) 0.618302277432711
ferritin 0.494117647058824
Serum chloride 0.46
fibrinogen 0.4
Activation of partial thromboplastin time 0.392
Eosinophil count 0.372649572649573
Tumor_necrosis_factor_alfa 0.27
indirect bilirubin 0.198373983739837
serum sodium 0.198373983739837
basophil count(#) 0.1875
Red blood cell distribution width 0.18
hematocrit 0.18
PLT distribution width 0.171428571428571
glucose 0.166666666666667
mean corpuscular hemoglobin 0.166666666666667
Red blood cell count 0.1
gender 0.0962315462315462
Total cholesterol 0.0346320346320347
glutamic-pyruvic transaminase 0.0309597523219814
PH value 0.0272727272727272
Hypersensitive cardiac troponinI 0
hemoglobin 0
Prothrombin time 0
Interleukin 2 receptor 0
Alkaline phosphatase 0
albumin 0
basophil(%) 0
Interleukin 10 0
Total bilirubin 0
antithrombin 0
Interleukin 8 0
total protein 0
Quantification of Treponema pallidum antibodies 0
Prothrombin activity 0
HBsAg 0
mean corpuscular volume 0
White blood cell count 0
mean corpuscular hemoglobin concentration 0
Interleukin_1_Beta 0
Urea 0
Corrected calcium 0
Serum potassium 0
neutrophils count 0
Direct bilirubin 0
Mean platelet volume 0
RBC distribution width SD 0
Thrombin time 0
HCV antibody quantification 0
Uric acid 0
HCO3- 0
calcium 0
Amino-terminal brain natriuretic peptide precursor(NT-proBNP) 0
platelet large cell ratio 0
Interleukin 6 0
Fibrin degradation products 0
monocytes count 0
globulin 0
gamma_glutamyl_transpeptidase 0
2019-nCoV nucleic acid detection 0
HIV antibody quantification 0
ESR 0
eGFR 0
creatinine 0

An attempt was made to limit the number of attributes to which importance is higher than 0. The most important attributes have been selected from the list above.

most_important_attr_rf <- importance_df %>%
                            filter(Importance > 0)

most_important_attr_rf <- most_important_attr_rf[1:ceiling(nrow(most_important_attr_rf)),1]

new_training <- training %>%
              select(c("outcome", all_of(most_important_attr_rf)))

new_testing <- testing %>%
            select(c("outcome", all_of(most_important_attr_rf)))

In the following, you will find the same process of training model like above.

First, the seed must be set.

set.seed(17)

Now, the training process with new training data.

new_rf_fit <- train(outcome ~ .,
                data = new_training,
                method = "rf",
                trControl = ctrl,
                ntree=10)

With the tuned model, we can predict classes. As previously, both types of prediction will be generated.

new_rfClasses_prob <- predict(new_rf_fit,
                          newdata = new_testing,
                          type = "prob")

new_rfClasses <- predict(new_rf_fit,
                     newdata = new_testing)

Below the confusion matrix for tuned model.

new_rf_confMatrix <- caret::confusionMatrix(data = new_rfClasses, new_testing$outcome)
new_rf_confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       52        4
##   survived    0       56
##                                           
##                Accuracy : 0.9643          
##                  95% CI : (0.9111, 0.9902)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9286          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4643          
##          Detection Rate : 0.4643          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9667          
##                                           
##        'Positive' Class : died            
## 

For now, you can see taht, the tuned model made more mistakes that original model, but did not make mistakes the so-called False Positive, which should be important to us in terms of prediction - the person who survives is qualified as the on who will die. This type of error is more desirable if we can choose between it and a False Negative error, which will qualify the person who dies as a survivor.

Below the ROC curve for tuned model.

new_rf_ROC <- roc(response = new_testing$outcome,
           predictor = new_rfClasses_prob[, "died"],
           levels = rev(levels(new_testing$outcome)))

plot(new_rf_ROC,
     print.auc = TRUE,
     col="black",
     lwd = 5)

legend("bottomright",
       col=c(par("fg"), "blue"),
       legend = c("Empirical"),
       lwd = 5)

The area under curve (AUC) is bigger than in the original model. That means, the tuned model is more efficient than the original.

Let check the learning curve for tuned model. The original function created for Random Forests will be used.

new_rf_mlc <- my_learning_curve_rf(training_data = new_training,
                                   test_data = new_testing,
                                   outcome = "outcome",
                                   trControl = ctrl)

tuned_model_learning_curves_1 <- ggplot(new_rf_mlc, aes(x = m)) +
                                  geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) + 
                                  theme_bw() +
                                  labs(title = "Learning Curves",
                                       x = "Examples",
                                       y = "RMSE value",
                                       color = "Learning Curve")

ggsave(filename = "tuned_model_learning_curves_1.svg",
       plot = tuned_model_learning_curves_1,
       device = "svg",
       path = "plots")

ggplotly(tuned_model_learning_curves_1)

The chart above, with the learning curves does not differ much from the original. The testing error is on the same level, only the training error has increased, but acceptable. Let’s check if it is possible to tune model more that now. The whole process must be repeated.

First, the extraction of most important attributes.

tuned_attr <- colnames(select(new_training, -"outcome"))
importance_vector2 <- as.vector(new_rf_fit$finalModel$importance)
importance_df2 <- data.frame(matrix(c(tuned_attr, importance_vector2), ncol = 2))
colnames(importance_df2) <- c("Attribute", "Importance")
importance_df2 <- importance_df2 %>%
                    arrange(desc(as.numeric(Importance)))
kable(importance_df2)
Attribute Importance
Lactate dehydrogenase 34.7140969637687
lymphocyte count 11.9829323741753
D-D dimer 11.5085680462158
High sensitivity C-reactive protein 7.03095580994492
monocytes(%) 6.56151731684465
thrombocytocrit 6.36876513255838
Serum chloride 5.7184954140937
PLT distribution width 5.43850297906969
eosinophils(%) 4.99905346238046
aspartate aminotransferase 4.72104091184587
neutrophils(%) 3.02638249371786
glucose 2.78834686726925
age 2.7695633118024
Red blood cell distribution width 2.31061815531321
(%)lymphocyte 1.66170233663501
gender 1.56580213903743
ferritin 1.56060039381754
Total cholesterol 1.55755526790951
procalcitonin 1.41184022610888
Tumor_necrosis_factor_alfa 1.39327698630936
Activation of partial thromboplastin time 1.37429218400686
International standard ratio 1.23499571887807
PH value 1.10586534002125
serum sodium 0.839661089661089
hematocrit 0.814999140300108
Platelet count 0.7990459931305
Eosinophil count 0.77741190520095
fibrinogen 0.568508641600968
indirect bilirubin 0.375051247771836
mean corpuscular hemoglobin 0.301658085697764
Red blood cell count 0.231741911560211
basophil count(#) 0.211228991596638
glutamic-pyruvic transaminase 0.198854256854258

Let’s try to train model attributes, which their importnace is higher than 5. There is only 8 these attributes.

most_important_attr_rf2 <- importance_df2 %>%
                              filter(as.numeric(Importance) > 5) %>%
                              select(Attribute)
kable(most_important_attr_rf2)
Attribute
Lactate dehydrogenase
lymphocyte count
D-D dimer
High sensitivity C-reactive protein
monocytes(%)
thrombocytocrit
Serum chloride
PLT distribution width

Now you should build a test ad training set based on the received attributes.

most_important_attr_rf2 <- most_important_attr_rf2$Attribute

new_training2 <- training %>%
              select(c("outcome", all_of(most_important_attr_rf2)))

new_testing2 <- testing %>%
            select(c("outcome", all_of(most_important_attr_rf2)))

Then, the model can be trained.

set.seed(17)
new_rf_fit2 <- train(outcome ~ .,
             data = new_training2,
             method = "rf",
             trControl = ctrl,
             ntree = 10)

With trained model, it is possible to predict the results.

new_rfClasses2 <- predict(new_rf_fit2,
                          newdata = new_testing2)

new_rfClasses2_prob <- predict(new_rf_fit2,
                               newdata = new_testing2,
                               type = "prob")

Below the confusion matrix to tuned model.

new_rf_confMatrix2 <- caret::confusionMatrix(data = new_rfClasses2,
                                             new_testing2$outcome)
new_rf_confMatrix2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died survived
##   died       52        4
##   survived    0       56
##                                           
##                Accuracy : 0.9643          
##                  95% CI : (0.9111, 0.9902)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9286          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4643          
##          Detection Rate : 0.4643          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9667          
##                                           
##        'Positive' Class : died            
## 

We see, that there is no differences between this model and original tuned model. Let’s check the ROC curve and AUC value.

new_rf_ROC2 <- roc(response = new_testing2$outcome,
                   predictor = new_rfClasses2_prob[, "died"],
                   levels = rev(levels(new_testing2$outcome)))

plot(new_rf_ROC2,
     print.auc = TRUE,
     col="black",
     lwd = 5)
legend("bottomright",
       col=c(par("fg"), "blue"),
       legend = c("Empirical"),
       lwd = 5)

The graph above tells us, that this model is not as good as the previous one. But, we have limited number of attributes from 33 to 8 attributes. It means, that we do not need to know the values of 33 the attributes, we need only 8 attributes to predict the result with the same accuracy. Below the chart with learning curves of new tuned model.

new_rf_mlc2 <- my_learning_curve_rf(training_data = new_training2,
                                   test_data = new_training2,
                                   outcome = "outcome",
                                   trControl = ctrl)

tuned_model_learning_curves_2 <- ggplot(new_rf_mlc2, aes(x = m)) +
                                  geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) + 
                                  theme_bw() +
                                  labs(title = "Learning Curves",
                                       x = "Examples",
                                       y = "RMSE value",
                                       color = "Learning Curve")

ggsave(filename = "tuned_model_learning_curves_2.svg",
       plot = tuned_model_learning_curves_2,
       device = "svg",
       path = "plots")

ggplotly(tuned_model_learning_curves_2)

The chart above shows us, that the training and testing error is on the same level. But both value decreased. So, it means, that next iteration of tuning gave us better model.

Summary

In the document we can find the analysis of the COVID-19 cases. After the short introduction, there is a part with graphical presentation of data that were used in report. After the graphic part, there was an attempt to create machine learning model of prediction the mortality of the patients. This model can be used e.g. in hospitals to judge, whose patient need to be hospitalized immediately. We extracted 8 attributes that are need to predict, with high efficiency, the mortality of patient. Below is the table with these attributes:

results <- data.frame(most_important_attr_rf2)
colnames(results) <- "Most important attributes"
kable(results)
Most important attributes
Lactate dehydrogenase
lymphocyte count
D-D dimer
High sensitivity C-reactive protein
monocytes(%)
thrombocytocrit
Serum chloride
PLT distribution width

These biological factor are significant to predict the cases of COVID by doctors. Some medical articles reach similar conclusions about test results, e.g. Hematologic, biochemical and immune biomarker abnormalities associated with severe illness and mortality in coronavirus disease 2019 (COVID-19): a meta-analysis, Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A systematic review and meta-analysis or Thrombocytopenia is associated with severe coronavirus disease 2019 (COVID-19) infections: A meta-analysis.

tree_func <- function(final_model, 
                      tree_num) {
  #source: https://shiring.github.io/machine_learning/2017/03/16/rf_plot_ggraph
  # get tree by index
  tree <- randomForest::getTree(final_model, 
                                k = tree_num, 
                                labelVar = TRUE) %>%
    tibble::rownames_to_column() %>%
    # make leaf split points to NA, so the 0s won't get plotted
    mutate(`split point` = ifelse(is.na(prediction), `split point`, NA))
  
  # prepare data frame for graph
  graph_frame <- data.frame(from = rep(tree$rowname, 2),
                            to = c(tree$`left daughter`, tree$`right daughter`))
  
  # convert to graph and delete the last node that we don't want to plot
  graph <- graph_from_data_frame(graph_frame) %>%
    delete_vertices("0")
  
  # set node labels
  V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`))
  V(graph)$leaf_label <- as.character(tree$prediction)
  V(graph)$split <- as.character(round(tree$`split point`, digits = 2))
  
  # plot
  plot <- ggraph(graph, 'dendrogram') + 
    theme_bw() +
    geom_edge_link() +
    geom_node_point() +
    geom_node_text(aes(label = node_label), na.rm = TRUE, repel = TRUE) +
    geom_node_label(aes(label = split), vjust = 2.5, na.rm = TRUE, fill = "white") +
    geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE, 
                    repel = TRUE, colour = "white", fontface = "bold", show.legend = FALSE) +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          panel.background = element_blank(),
          plot.background = element_rect(fill = "white"),
          panel.border = element_blank(),
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank(),
          plot.title = element_text(size = 18))
  
  print(plot)
}

Below example of the tree from forests.

tree <- tree_func(final_model = new_rf_fit2$finalModel,
                  tree_num = 5)

The conclusions are following. The model of prediction could be use to assess the state of health the patients. The model sufficiently reflects reality, but it only considers blood tests’ results. The Patient’s symptoms and the other data such as age, gender, comorbidities are not taken into account. First of all, the model should play a supporting role in diagnosing the course of the disease - it should not, prejudge the treatment of the patient. It should be remembered that each case is different and an individual course of treatment should be selected for each case.

Bibliography

Aleksandra Chachowska, lek. Paweł Żmuda-Trzebiatowski lek. 2020. “Monocyty. Kiedy Warto Je Zbadać?” https://www.medonet.pl/zdrowie/zdrowie-dla-kazdego,monocyty---norma--podwyzszone--za-wysokie--powyzej-normy-,artykul,1655749.html.

Aleksandra Witkowska. 2020a. “Albuminy.” https://www.medonet.pl/badania,albuminy,artykul,1570119.html.

Andrzej Dębski, lek. Paweł Żmuda-Trzebiatowski. 2020. “Trombocyty. Płytki Krwi, Zaburzenia Krzepnięcia I Inne Schorzenia.” https://www.medonet.pl/zdrowie/zdrowie-dla-kazdego,plytki-krwi--trombocyty--plt---norma--niski-i-wysoki-poziom,artykul,1719033.html.

“Badanie Alt – Norma I Wyniki. Co Znaczy Wysokie Alt?” 2020. https://www.medme.pl/artykuly/badanie-alt-badanie-krwi-w-diagnozie-uszkodzen-watroby-norma,67611.html.

“Bazofile (Granulocyty Zasadochłonne) - Normy. Co Oznacza Wysoki Lub Niski Poziom Bazofili?” 2020. https://lekarzebezkolejki.pl/blog/bazofile-granulocyty-zasadochlonne-normy-co-oznacza-wysoki-lub-niski-poziom-bazofili/w-612.

“Bilirubina Bezpośrednia (Sprzężona) - Objawy, Leczenie I Badania.” 2020. https://www.doz.pl/zdrowie/h11103-Bilirubina_bezposrednia_sprzezona.

“Bilirubina Pośrednia Badanie.” 2020. https://www.synevo.pl/bilirubina-posrednia/.

Brandon Michael Henry, Stefanie Benoit, Maria Helena Santos de Oliveira. 2020. “Hematologic, Biochemical and Immune Biomarker Abnormalities Associated with Severe Illness and Mortality in Coronavirus Disease 2019 (Covid-19): A Meta-Analysis.” https://pubmed.ncbi.nlm.nih.gov/32286245/.

“D-Dimery: Norma, Badanie I Wyniki. Przyczyny Podwyższonych?” 2020. https://www.medme.pl/artykuly/d-dimery-norma-badanie-i-wyniki-przyczyny-podwyzszonych,73373.html.

“Eozynofil.” 2020. https://pl.wikipedia.org/wiki/Eozynofil.

“Erytrocyty (Czerwona Krwinki) - Jaka Jest Ich Prawidłowa Ilość?” 2020. https://lekarzebezkolejki.pl/blog/erytrocyty-czerwona-krwinki-jaka-jest-ich-prawidlowa-ilosc/w-473.

Ewa Talarek. 2020a. “Badanie Stężenia Potasu W Surowicy Krwi.” https://www.mp.pl/pacjent/badania_zabiegi/174203,badanie-stezenia-potasu-w-surowicy-krwi.

———. 2020b. “Badanie Stężenia Sodu W Surowicy Krwi.” https://www.mp.pl/pacjent/badania_zabiegi/174208,badanie-stezenia-sodu-w-surowicy-krwi.

———. 2020c. “Badanie Stężenia Wapnia W Surowicy Krwi.” https://www.mp.pl/pacjent/badania_zabiegi/174223,badanie-stezenia-wapnia-w-surowicy-krwi.

“Ferrytyna - Normy I Znaczenie Dla Zdrowia.” 2020. https://lekarzebezkolejki.pl/blog/ferrytyna-normy-i-znaczenie-dla-zdrowia/w-963.

“Fosfataza Alkaliczna (Alp).” 2020. https://www.medicover.pl/badania/fosfataza-alkaliczna/.

Giuseppe Lippi, Brandon Michael Henry, Mario Plebani. 2020. “Thrombocytopenia Is Associated with Severe Coronavirus Disease 2019 (Covid-19) Infections: A Meta-Analysis.” https://pubmed.ncbi.nlm.nih.gov/32178975/.

“HCV Rna Met. Real Time Rt- Pcr, Ilościowo.” 2020. https://diag.pl/sklep/badania/hcv-met-pcr-ilosciowo/.

“Hematokryt (Hct) – Normy. Co Oznacza Niski I Wysoki Poziom Hct?” 2020. https://lekarzebezkolejki.pl/blog/hematokryt-hct-normy-co-oznacza-niski-i-wysoki-poziom-hct/w-603.

“Hemoglobina - Jaki Jest Jej Prawidłowy Poziom?” 2020. https://lekarzebezkolejki.pl/blog/hemoglobina-jaki-jest-jej-prawidlowy-poziom/w-433.

“HIV-1 Rna Met. Real Time Rt-Pcr, Ilościowo.” 2020. https://diag.pl/sklep/badania/hiv-1-met-pcr-ilosciowo/.

“Interleukiny - Rodzaje, Funkcje, Rola W Powstawaniu Chorób Autoimmunologicznych.” 2020. https://centermed.pl/news/631,Interleukiny__rodzaje_funkcje_rola_w_powstawaniu_chorob_autoimmunologicznych.

Irina Bosek, Tomasz Miłek, Roman Kuczerowski. 2020. “Stężenie Interleukiny 2 I Interleukiny 10 U Pacjentów Z Cukrzycą Typu 2 I Rakiem Okrężnicy.” https://journals.viamedica.pl/diabetologia_praktyczna/article/download/58009/43758.

Janowska, Sara. 2020. “Limfocytoza - O Czym świadczy Zbyt Wysoki Poziom Limfocytów?” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/limfocytoza-o-czym-swiadczy-zbyt-wysoki-poziom-limfocytow-aa-K94G-LDUH-GUH2.html.

Jurewicz, Andrzej. 2020a. “Badanie Crp - Normy, Interpretacja Wyników. Podwyższone Crp a Diagnostyka Chorób.” https://www.medonet.pl/zdrowie/zdrowie-dla-kazdego,crp-badanie---norma-badania-krwi--podwyzszone--wysokie-crp-,artykul,1659270.html#hs-crp-badanie-crp-o-wysokiej-czulosci.

———. 2020b. “Mocznik, Azot Mocznikowy - Badanie, Normy. Obniżone I Podwyższone Stężenie Mocznika.” https://www.medonet.pl/badania,mocznik-we-krwi--bun----norma-badania,artykul,1570136.html.

Karolina Karabin. 2020a. “Białko Całkowite - Normy W Badaniu Biochemicznym.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/bialko-calkowite-normy-w-badaniu-biochemicznym-aa-tj5J-sRC7-u7oy.html.

Katarzyna Pawlikowska-Łagód, lek. Beata Wańczyk-Dręczewska. 2020a. “MCHC (średnie Stężenie Hemoglobiny W Krwince Czerwonej) - Normy, Nadmiar, Leczenie.” https://www.medonet.pl/badania,mchc---srednie-stezenie-hemoglobiny-w-krwince-czerwonej,artykul,1570107.html#mchc-ponizej-normy.

———. 2020b. “MCHC (średnie Stężenie Hemoglobiny W Krwince Czerwonej) - Normy, Nadmiar, Leczenie.” https://www.medonet.pl/badania,mchc---srednie-stezenie-hemoglobiny-w-krwince-czerwonej,artykul,1570107.html.

“Kiła (Treponema Pallidum), Vdrl, Monitorowanie Leczenia.” 2020. https://diag.pl/sklep/badania/kila-treponema-pallidum-vdrl/.

Klaudia Knap. 2020a. “Czas Protrombinowy I Inr.” https://www.mp.pl/pacjent/badania_zabiegi/152270,czas-protrombinowy-i-inr.

———. 2020b. “Troponiny Sercowe.” https://www.mp.pl/pacjent/badania_zabiegi/152291,troponiny-sercowe.

Kornel Gajewski. 2020a. “Chlorki.” https://www.mp.pl/pacjent/badania_zabiegi/171833,chlorki.

———. 2020b. “Dehydrogenaza Mleczanowa (Ldh).” https://www.mp.pl/pacjent/badania_zabiegi/171840,dehydrogenaza-mleczanowa-ldh.

———. 2020c. “Gamma-Glutamylotranspeptydaza (Ggtp).” https://www.mp.pl/pacjent/badania_zabiegi/137074,gamma-glutamylotranspeptydaza-ggtp.

“Kreatynina - Czym Jest? Badanie Kreatyniny I Normy.” 2020. https://lekarzebezkolejki.pl/blog/kreatynina-czym-jest-badanie-kreatyniny-i-normy/w-1328.

Leiwen Fu, Tanwei Yuan, Bingyi Wang. 2020. “Clinical Characteristics of Coronavirus Disease 2019 (Covid-19) in China: A Systematic Review and Meta-Analysis.” https://pubmed.ncbi.nlm.nih.gov/32283155/.

Li Yan, [...], Hai-Tao Zhang. 2020. “An Interpretable Mortality Prediction Model for Covid-19 Patients.” Nature Machine Intelligence. https://www.nature.com/articles/s42256-020-0180-7.

Majewska, Monika. 2020a. “Neutrofile: Normy. Podwyższony, Obniżony Poziom Neut.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/neutrofile-normy-podwyzszony-obnizony-poziom-neut-aa-EzVK-Qfpg-QhBs.html.

Mariusz Basiura. 2020. “Eozynofile (Eozynocyty, Eos): Podwyższone Lub Poniżej Normy.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/eozynofile-eozynocyty-eos-podwyzszone-lub-ponizej-normy-aa-sAcV-QXWn-EJti.html.

Marlena Kostyńska, lek. Aleksandra Witkowska. 2020. “Leukocyty. Kiedy Wykonać Badanie? O Czym świadczą Wyniki?” https://www.medonet.pl/badania,leukocyty---rodzaje--badanie-krwi--normy--obecnosc-w-moczu,artykul,1570110.html.

Marlena Kostyńska, lek. Beata Wańczyk-Dręczewska. 2020. “Czas Kaolinowo-Kefalinowy (Aptt). Pomiar Czasu Częściowej Tromboplastyny Po Aktywacji.” https://www.medonet.pl/badania,czas-kaolinowo-kefalinowy--czas-czesciowej-tromboplastyny-po-aktywacji--ang--activated-partial-thromboplastin-time--aptt-,artykul,1570157.html.

Marlena Kostyńska, lek. Paweł Żmuda-Trzebiatowski. 2020a. “Czas Trombinowy (Tt) – Krzepnięcie Krwi, Przebieg Badania, Odstępstwa Od Normy.” https://www.medonet.pl/badania,czas-trombinowy--tt----krzepniecie-krwi--przebieg-badania--odstepstwa-od-normy,artykul,1734547.html.

———. 2020b. “Fibrynogen – Kiedy Oznaczamy, Przebieg Badania, Odstępstwa Od Normy.” https://www.medonet.pl/badania,fibrynogen---kiedy-oznaczamy--przebieg-badania--odstepstwa-od-normy,artykul,1570274.html.

Marta Pawlak, lek. Aleksandra Witkowska. 2020a. “Antytrombina - Kiedy Sprawdzić Jej Aktywność I Stężenie?” https://www.medonet.pl/badania,antytrombina---kiedy-sprawdzic-jej-aktywnosc-i-stezenie-,artykul,1734857.html.

———. 2020b. “RDW-Sd - Oznaczanie, Wskazania, Norma, Analiza Wyników.” https://www.medonet.pl/zdrowie,rdw-sd---oznaczanie--wskazania--norma--analiza-wynikow,artykul,1733018.html.

Mazur, Justyna. 2020. “TNF Alfa (Czynnik Martwicy Nowotworu) – Badanie, Norma, Leczenie.” https://wylecz.to/badania-laboratoryjne/tnf-alfa-czynnik-martwicy-nowotworu-badanie-norma-leczenie/.

“MCV (średnia Objętość Erytrocytów) - Normy, Wysoki I Niski Poziom.” 2020. https://lekarzebezkolejki.pl/blog/mcv-srednia-objetosc-erytrocytow-normy-wysoki-i-niski-poziom/w-1327.

Monika Dąbrowska. 2020. “Limfocyty - Czym Są, Jakie Są Normy, Co Oznacza Spadek Poziomu Limfocytów.” https://enel.pl/enelzdrowie/zdrowie/limfocyty-czym-sa-jakie-sa-normy-co-oznacza-spadek-poziomu-limfocytow.

Monika Wasilonek, lek. Aleksandra Witkowska. 2020. “Badanie P-Lcr - Wskazania Do Wykonania I Normy.” https://www.medonet.pl/zdrowie,badanie-p-lcr---wskazania-do-wykonania-i-normy,artykul,1733026.html#badanie-p-lcr-normy.

Morzy, Tadeusz. 2013. Eksploracja Danych. Metody I Algorytmy.

“Neutrofile Poniżej Normy. Jakie Są Przyczyny?” 2020. https://lekarzebezkolejki.pl/blog/neutrofile-ponizej-normy-jakie-sa-przyczyny/w-614.

Nowacka, Mirosława. 2020. “Wapń Całkowity, Zjonizowany I Skorygowany.” http://odiagnostyce.pl/2016/08/20/wapn-calkowity-zjonizowany-i-skorygowany.

“NT Pro-Bnp – Charakterystyka, Wskazania Do Badania, Przeprowadzenie Badania, Norma, Interpretacja Wyników.” 2020. https://portal.abczdrowie.pl/nt-pro-bnp-charakterystyka-wskazania-do-badania-przeprowadzenie-badania-norma-interpretacja-wynikow.

Paculanka, Agnieszka. 2020. “OB Czyli Odczyn Biernackiego. Jakie Są Normy Ob?” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/ob-odczyn-biernackiego-jakie-sa-normy-ob-aa-ew6R-HroV-s9b2.html.

“PDW (Morfologia Krwi) - Norma, Podwyższone I Obniżone Pdw.” 2020. https://lekarzebezkolejki.pl/blog/pdw-morfologia-krwi-norma-podwyzszone-i-obnizone-pdw/w-1452.

Socha-Chyrzyński, Natasza. 2020. “Gazometria: Wyniki I Normy pH Krwi.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/gazometria-wyniki-i-normy-ph-krwi-aa-6EWC-V3wE-QRN5.html.

Tomasz Nęcki. 2020a. “Monocyty (Mono) - Rola, Norma, Nadmiar I Niedobór.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/monocyty-mono-rola-norma-nadmiar-i-niedobor-aa-FvLD-BrBi-DChY.html.

———. 2020b. “Wskaźnik Rdw Sd (Wskaźnik Rozkładu Objętości Erytrocytów) - Normy.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/wskaznik-rdw-sd-wskaznik-rozkladu-objetosci-erytrocytow-normy-aa-9gN7-txfr-Ahbo.html.

Zuzanna Łagódka. 2020. “Gazometria - Czym Jest to Badanie?” https://enel.pl/enelzdrowie/zdrowie/gazometria-czym-jest-to-badanie.